W5: Iterating tasks

Iterating tasks

Suppose that you want to repeat a chunk of code many times, but changing one variable’s value each time you do it: This could be modifying each element of a vector in the same way, or analyzing a dataframe multiple times with different parameters.

Iterating tasks: solutions

  1. Copy and paste the code chunk, and change that variable’s value. Repeat. This can be a starting point in your analysis, but will lead to errors easily.
  1. Functionals (_apply, map_ functions) allow you to take a function that solves the problem for a single input and generalize it to handle any number of inputs. This is very popular in R programming culture.
  1. Use a for loop to repeat the chunk of code, and let it loop over the changing variable’s value. This is popular for many programming languages, but the R programming culture encourages a functional way instead.

Review of lists

  • Remember, lists are the most general data structure
  • We can put anything into them
    • Specifically, talking about lists with same things in each slot
    • For example:
      • a list of file paths
      • a list of data.frames
      • a list of plots
  • Ideally, we want to apply the same function to them.

Functionals via map()

map() takes in a vector or a list, and then applies the function on each element of it. The output is always a list.

map(my_vector, log)
    ^vector   ^function

purrr::map()

Another View of map()

The basic formula

  1. Define what you want to do
  2. Do it once on test data, write function if necessary
  3. Make a list of objects to iterate through
  4. Apply function multiple times on the list elements

1. Loading up a list of files

2. Do it once

We want to load up four data frames in the data/tumor/ directory using read_csv. Let’s try doing it for one first:

3. Make a list

Now we build our list by listing the files in data/tumor/:

4. Do it multiple times

Now we can apply read_csv to each element in file_list. We can load these up by applying read_csv on each element of the file list.

Check what df_list[[1]] is:

Plotting our Data Frames

  1. We want to apply a plotting function to every data frame in a list.
  2. Write a function called plot_recurrence. Try it out with a data.frame
  3. Load our data into a list called df_list using read_csv.
  4. Apply our function plot_recurrence to df_list using map()

1. Define What we want to do

2. Write a function, try it out on one DF

Write a function called plot_recurrence that plots days_to_last_follow_up vs. age_at_diagnosis. Try running it on lusc_data

plot_recurrence <- function(df){ ggplot(df) + aes(x= days_to_last_followup, y= age_at_diagnosis) + geom_point() } plot_recurrence(lusc_data)
plot_recurrence <- function(df){
  ggplot(df) +
    aes(x= days_to_last_followup,
        y= age_at_diagnosis) +
    geom_point()
}

plot_recurrence(lusc_data)

3. Build our list

  • We use read_csv on our list of file paths.

4. Apply our Function to the list

Try applying plot_recurrence() to each element of df_list using map().

Critical Things to Think about:

  • What is the unit in the list?
  • How do I call a function?
  • How do I call extra arguments?

Variations of map()

To be more specific about the output type, you can do this via the map_* function, where * specifies the output type: map_lgl(), map_chr(), and map_dbl() functions return vectors of logical values, strings, or numbers respectively.

map_dbl

Expects a single double return value:

map_lgl

Expects logical output from each element (TRUE, FALSE):

Case study 3: Iterate over different conditions to analyze a dataframe

Suppose you are working with the penguins dataframe:

head(penguins)

and you want to look at the mean bill_length_mm for each of the three species (Adelie, Chinstrap, Gentoo).

The Process

  1. Define what you want to do
  2. Do it once on test data, write function if necessary
  3. Make a list of objects to iterate through
  4. Apply function multiple times on the list elements

Step 1: Define what you want to do

We want to look at the mean bill_length_mm for each of the three species (Adelie, Chinstrap, Gentoo).

Step 2: Do it Once

Adapt the below code into a function. Try it out on the first element of species_to_analyze()

analyze_bill <- function(species_to_analyze){ penguins_subset = filter(penguins, species == "Adelie") out <- mean(penguins_subset$bill_length_mm, na.rm = TRUE) return(out) } analyze_bill("Adelie")

analyze_bill <- function(species_to_analyze){
  penguins_subset = filter(penguins, species == "Adelie")
  out <- mean(penguins_subset$bill_length_mm, na.rm = TRUE)
  return(out)
}

analyze_bill("Adelie")

Step 3: Make a list

  • Variable we need to loop through: c("Adelie", "Chinstrap", "Gentoo")

Step 4: Apply function on list elements

  • The looping mechanism, and its output: map_dbl() outputs (double) numeric vectors.

Apply analyze_bill to species_to_analyze:

map_dbl(-------, analyze_bill)
map_dbl(-------, analyze_bill)

Map family of functions

More info at: https://adv-r.hadley.nz/functionals.html

For loops, briefly

A for loop repeats a chunk of code many times, once for each element of an input vector.

for (my_element in my_vector) {
  chunk of code
}

Most often, the “chunk of code” will make use of my_element.

my_vector = c(1, 3, 5, 7)
for (my_element in my_vector) {
  print(my_element)
}
[1] 1
[1] 3
[1] 5
[1] 7

Loop through the indicies of a vector

my_vector = c(1, 3, 5, 7)
seq_along(my_vector)
[1] 1 2 3 4
for(i in seq_along(my_vector)) {
  print(my_vector[i])
}
[1] 1
[1] 3
[1] 5
[1] 7
my_vector = c(1, 3, 5, 7)
result = rep(NA, length(my_vector))

for(i in seq_along(my_vector)) {
  result[i] = log(my_vector[i])
}

result
[1] 0.000000 1.098612 1.609438 1.945910